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ABSTRACT 

Embedded  systems  are  distinguished  from  general- 
purpose  computers  in  that  they  consist  of  special- 
purpose  hardware  and  software  optimized  for  a  specific 
task.  They  are  pervasive  in  Army  systems,  appearing  in 
soldier  radios,  sensor  systems,  vehicle  control,  communi¬ 
cation  systems,  and  many  other  applications.  This  paper 
focuses  on  multiprocessor  embedded  systems  targeted  to¬ 
wards  signal,  image,  and  video  processing  applications 
requiring  large  computing  power  and  having  real-time 
performance  requirements.  As  transistor  sizes  shrink,  in¬ 
terconnects  represent  a  significant  bottleneck  for  embed¬ 
ded  systems  designers.  Several  groups  are  researching 
optical  interconnects  to  cope  with  this  trend.  Optical  in¬ 
terconnects  enable  new  system  architectures.  These  new 
architectures  in  turn  require  new  methods  for  high-level 
application  mapping  and  hardware/software  co-design. 
In  this  presentation,  we  discuss  high-level  scheduling  and 
interconnect  topology  synthesis  techniques  for  embedded 
multiprocessors.  We  focus  on  designs  that  are  stream¬ 
lined  for  one  or  more  digital  signal  processing  (DSP)  ap¬ 
plications.  That  is,  we  seek  to  synthesize  an  application- 
specific  interconnect  topology  for  a  multiprocessor  DSP 
design.  We  show  that  flexible  interconnect  topologies 
that  allow  single-hop  communication  between  processors 
offer  advantages  for  reduced  power  and  latency. 

We  have  previously  shown  that  multiprocessor 
scheduling  algorithms  can  deadlock  in  the  general  case 
of  a  topology  graph  that  is  not  strongly  connected,  or 
if  communication  is  limited  to  be  single  hop.  We  have 
also  demonstrated  an  efficient  algorithm  that  can  be  used 
in  conjunction  with  existing  scheduling  algorithms  for 
avoiding  this  deadlock  ||I[ .  In  this  presentation  we  discuss 
the  advantages  of  performing  application  scheduling  and 
interconnect  synthesis  jointly,  and  present  a  probabilis¬ 
tic  scheduling/interconnect  algorithm  utilizing  graph  iso¬ 
morphism  to  pare  the  design  space.  We  demonstrate  the 
performance  advantages  that  an  application-specific  in¬ 
terconnect  topology  can  produce  for  several  DSP  bench¬ 


marks. 

1.  OPTICAL  INTERCONNECTS 

In  recent  years,  optics  have  played  an  increasing  role  in 
multiprocessor  systems.  Commercial  high-performance 
computers  now  use  fiber  ribbons  to  connect  multiple 
processing  nodes.  Other  examples  include  storage  area 
networks  using  fiberchannel,  and  optical  clock  distribu¬ 
tion  to  reduce  clock  skew  across  a  chip.  Programs  such 
as  the  DARPA  VLSI  Photonics  0  program  are  push¬ 
ing  to  integrate  photonics  technology  on  a  single  chip. 
Intel  is  currently  backing  an  effort  to  bring  “fiber-to- 
the-processor”  0.  The  idea  is  to  break  the  processor 
to  cache  bottleneck  by  using  an  optical  waveguide  inte¬ 
grated  on  the  processor  chip. 

2.  CONNECTION  TOPOLOGIES 

Electrically  connected  multiprocessor  systems  generally 
have  a  regular  interconnection  pattern,  due  to  the  physi¬ 
cal  constraints  imposed  by  two-dimensional  circuit  board 
layout.  Some  examples  include  ring,  mesh,  bus,  and  hy¬ 
percube  interconnect  topologies.  Using  these  topologies, 
communication  between  remote  processors  requires  mul¬ 
tiple  hops,  which  increases  both  latency  and  power,  and 
increases  contention  throughout  the  network. 

In  contrast,  optically  connected  multiprocessors, 
particularly  those  utilizing  free  space  optics  and  three 
dimensions,  are  free  to  utilize  arbitrarily  irregular 
interconnection  networks.  Once  the  signal  is  in  the 
optical  domain,  there  is  very  little  attenuation,  so  the 
energy  required  to  transmit  a  unit  of  data  is  essentially 
independent  of  distance.  The  required  energy  instead 
is  a  function  of  the  number  of  electrical-to-optical 
conversions  that  must  be  performed  0,  which  in  turn 
is  determined  by  the  number  of  hops.  With  single-hop 
schedules  the  overhead  associated  with  routing  data 
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Application 

N 

A{E){%) 

A(M)(%) 

FFTl 

7 

16 

8 

Karp  10 

6 

24 

4 

Irr 

8 

16 

(2) 

Qmf4 

7 

32 

3 

NN16-3-4 

8 

58 

2 

Suml 

6 

1 

4 

Laplace 

7 

4 

(3) 

FFT2 

7 

12 

2 

Table  1:  Reduction  in  communication  energy  {A{E)) 
and  makespan  increase  (A(M))  of  single  hop  schedule 
over  three-hop  schedule. 


through  intermediate  processors  is  eliminated.  Fur¬ 
thermore,  due  to  the  flexibility  of  the  communication 
medium,  it  is  generally  possible  to  avoid  multi-hop 
communication  operations  by  simply  activating  di¬ 
rect  communication  channels  between  the  source  and 
destination  processors.  Together,  these  properties 
make  it  desirable  to  limit  the  number  of  hops  per 
communication  operation  when  exploring  configurations 
(interconnection  patterns  and  task  graph  mappings)  for 
an  optically  connected,  embedded  multiprocessor. 

3.  SCHEDULING  AND  INTERCONNECT 
SYNTHESIS  ALGORITHMS 

In  order  to  quantify  this  effect,  we  scheduled  several  DSP 
benchmark  applications  using  our  modified  scheduling 
technique,  which  takes  the  number  of  hops  as  an  input 
parameter.  We  scheduled  the  benchmarks  with  hop  con¬ 
straints  of  one  hop  and  three  hops,  and  compared  the 
communication  energy  required.  For  our  purposes,  we 
assumed  all  communication  tasks  transferred  the  same 
number  of  bits,  so  the  energy  cost  of  all  IPC  actors  was 
equal.  Table  |I]  shows  the  reduction  in  the  required  com¬ 
munication  energy  for  single-hop  schedules  over  three- 
hop  schedules  for  the  benchmark  applications.  For  these 
benchmarks,  we  found  that  any  undesirable  effect  on 
the  makespan  of  the  additional  constraint  for  single-hop 
schedules  was  very  small,  as  can  be  seen  in  Table  |I|.  In 
two  of  the  benchmarks  (Irr  and  Laplace),  the  makespan 
was  in  fact  better  (lower)  when  we  limited  the  scheduler 
to  single  hops. 

We  present  a  genetic  algorithm  for  synthesizing  effi¬ 
cient  interconnection  networks  for  embedded  multipro¬ 
cessors.  The  algorithm  works  in  conjunction  with  a  list 
scheduling  algorithm  to  jointly  optimize  both  the  sched¬ 
ule  and  the  interconnect  topology.  The  algorithm  is  able 
to  account  for  different  distributions  of  local  vs.  global 
(long)  interconnect  routing  tracks  via  a  processor  fanout 


5  nodes 


Figure  1:  Comparison  of  number  of  graphs  (top)  with 
|P|  =  5  nodes  to  those  that  are  isomorphically  unique 
(bottom). 

constraint. 

Figure  |I|  compares  the  number  of  possible  graph  label¬ 
ings  (for  a  graph  with  5  nodes  and  varying  numbers  of 
edges)  with  the  number  of  isomorphically  unique  graphs. 
By  searching  only  isomorphically  unique  topologies,  our 
interconnect  synthesis  algorithm  pares  the  design  space 
significantly  and  searches  more  efficiently. 

We  evaluated  our  interconnect  synthesis  algorithm  on 
several  DSP  benchmark  application  graphs.  We  cal¬ 
culated  how  the  makespan  improves  as  the  maximum 
fanout  constraint  is  increased.  This  amounts  to  an 
area/performance  tradeoff  in  the  system.  We  also  com¬ 
pared  the  performance  of  systems  with  topologies  avail¬ 
able  with  electrical  interconnects  vs.  optical  intercon¬ 
nects.  These  topics  will  be  described  in  the  presentation. 
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